ggplot2 Graphics

The package ggplot2 by Hadley Wickham implements graphs based on the grammar of graphs ideas. The concept for this plotting framework was laid out by Leland Wilkinson in his 1999 book Grammar of Graphics.

It has two main principles:

There are seven grammatical elements in total, three are essential:

The non-essential ones include:

A good reference for ggplot can be found at http://ggplot2.tidyverse.org/reference/.

Principles of ggplot2

All plots are generated with the same function: ggplot(). However, this function only initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.

ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())

It has two main arguments:

data Default dataset to use for plot. If not already a data.frame, will be converted to one by fortify. If not specified, must be suppled in each layer added to the plot.

mapping Default list of aesthetic mappings to use for plot. If not specified, must be suppled in each layer added to the plot.

> library(ggplot2)
> p <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy))

There are three common ways to invoke ggplot():

ggplot(df, aes(x, y, <other aesthetics>))

ggplot(df)

ggplot()

The first method is recommended if all layers use the same data and the same set of aesthetics, although this method can also be used to add a layer using data from another data frame. The second method specifies the default data frame to use for the plot, but no aesthetics are defined up front. This is useful when one data frame is used predominantly as layers are added, but the aesthetics may vary from one layer to another. The third method initializes a skeleton ggplot object which is fleshed out as layers are added. This method is useful when multiple data frames are used to produce different layers, as is often the case in complex graphics.

In order to see something, we have add “layers”. These are added by using the + operator. Behind the scenes it calls the layer() function to create a new layer. You will never use the layer function, but it is good for your understanding of how ggplot2 works.

p + layer(
  mapping = NULL, 
  data = NULL,
  geom = "point", geom_params = list(),
  stat = "identity", stat_params = list(),
  position = "identity"
)

This call fully specifies the five components to the layer:

It’s useful to understand the layer() function so you have a understanding of the layer object. But you’ll rarely use the full layer() call because it’s so verbose. Instead, you’ll use the shortcut geom_ functions: geom_point(mapping, data, ...) is exactly equivalent to layer(mapping, data, geom = "point", ...).

> p + geom_point()

Data

Every layer must have some data associated with it, and that data must be in a data frame. However, the data on each layer doesn’t need to be the same, and it’s often useful to combine multiple datasets in a single plot. Let us illustrate this idea:

> mod <- loess(hwy ~ displ, data = mpg)
> grid <- data.frame(displ = seq(min(mpg$displ), max(mpg$displ), length = 50))
> grid$hwy <- predict(mod, newdata = grid)
> std_resid <- resid(mod) / mod$s
> outlier <- subset(mpg, abs(std_resid) > 2)
> ggplot(mpg, aes(displ, hwy)) + 
+   geom_point() + 
+   geom_line(data = grid, colour = "blue", size = 1.5) + 
+   geom_text(data = outlier, aes(label = model))

Aesthetics

The aesthetic mappings, defined with aes(), describe how variables are mapped to visual properties or aesthetics. In simpler words, an aesthetic is “something you can see”. They include

  • position (i.e., on the x and y axes)
  • color
  • fill (the “inside” of a plotting character)
  • shape (of points)
  • linetype
  • size

aes() takes a sequence of aesthetic-variable pairs like this:

> aes(x = displ, y = hwy, colour = class)

Here we map x-position to displ, y-position to hwy, and colour to class. The names for the first two arguments can be ommitted, in which case they correspond to the x and y variables. That makes this specification equivalent to the one above:

> aes(displ, hwy, colour = class)

Aesthetic mappings can be supplied in the initial ggplot() call, in individual layers, or in some combination of both. All of these calls create the same plot specification:

> ggplot(mpg, aes(displ, hwy, colour = class)) + 
+   geom_point()
> ggplot(mpg, aes(displ, hwy)) + 
+   geom_point(aes(colour = class))
> ggplot(mpg, aes(displ)) + 
+   geom_point(aes(y = hwy, colour = class))
> ggplot(mpg) + 
+   geom_point(aes(displ, hwy, colour = class))

The aesthetics set in the ggplot() function are the default for all other layers. You can, however, add, override, or even remove mappings (e.g. aes(y=NULL)) in the layers.

> par(mfrow=c(1,2))
> ggplot(mpg, aes(displ, hwy, colour = class)) + 
+   geom_point() + 
+   geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess'

> ggplot(mpg, aes(displ, hwy)) + 
+   geom_point(aes(colour = class)) + 
+   geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess'

Geoms

Geometric objects are the actual marks we put on a plot. We have already seen a few examples of geoms: geom_points, geom_line, and geom_smooth, however, there are many more. You can get a list of available geometric objects by the help.search() command:

> help.search("geom_", package = "ggplot2")

An excellent overview can be found at [http://sape.inf.usi.ch/quick-reference/ggplot2/geom]

There are a few rules to using geoms:

  • A plot must have at least one geom; there is no upper limit.
  • You can add a geom to a plot using the + operator
  • Each type of geom accepts only a subset of all aesthetics - refer to the geom help ape to see what mappings each geom accepts. E.g. geom_point requires mappings for x and y, all others are optional. The geom_text accepts a labels mapping.

Statistical Transformations

Some plot types (such as scatterplots) do not require transformations–each point is plotted at x and y coordinates equal to the original value. Other plots, such as boxplots, histograms, prediction lines etc. require statistical transformations: For a boxplot the y values must be transformed to the median and 1.5(IQR). For a histogram, the x values must be transformed to the bin.

Each geom has a default statistic, but these can be changed. For example, the default statistic for geom_bar is stat_bin.

> args(geom_histogram)
function (mapping = NULL, data = NULL, stat = "bin", position = "stack", 
    ..., binwidth = NULL, bins = NULL, na.rm = FALSE, show.legend = NA, 
    inherit.aes = TRUE) 
NULL
> args(stat_bin)
function (mapping = NULL, data = NULL, geom = "bar", position = "stack", 
    ..., binwidth = NULL, bins = NULL, center = NULL, boundary = NULL, 
    breaks = NULL, closed = c("right", "left"), pad = FALSE, 
    na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) 
NULL

Arguments to stat_ functions can be passed through geom_ functions. This can be slightly annoying because in order to change it you have to first determine which stat the geom uses, then determine the arguments to that stat.

> housing <- read.csv("/home/Fabian2/Desktop/StatCompMeta/Rgraphics/dataSets/landdata-states.csv")
> head(housing[1:5])
  State region    Date Home.Value Structure.Cost
1    AK   West 2010.25     224952         160599
2    AK   West 2010.50     225511         160252
3    AK   West 2009.75     225820         163791
4    AK   West 2010.00     224994         161787
5    AK   West 2008.00     234590         155400
6    AK   West 2008.25     233714         157458
> 
> p2 <- ggplot(housing, aes(x = Home.Value))
> p2 + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

by changing the binwidth argument in the stat_bin function we can obtain:

> p2 + geom_histogram(stat = "bin", binwidth=4000)

Coordinates

The coordinate system of a plot, together with the x and y position scale, determines the location of geoms. There are a few options:

  • coord_cartesian - (default) cartesian coordinate system (x horizontal from left to right, y vertical from bottom to top)
  • coord_flip - flipped cartesian coordinate system (x vertical from bottom to top, y horizontal from left to right)
  • coord_trans - transformed cartesion coordinate system, e.g. log transform
  • coord_equal - forces a fixed ratio of the physical representations on the x and y axis
  • coord_polar - polar coordinate system; the x (or y) scale is mapped to the angle (theta)
  • coord_map - various map projections
> d <- data.frame(height = c(1,2,2,3,4), weight = c(1,3,4,4,2))
> p <- ggplot() +
+   geom_line(data=d, mapping=aes(x=height, y=weight)) +
+   geom_point(data=d, mapping=aes(x=height, y=weight), size=8, fill="white", shape=21) +
+   geom_text(data=d,mapping=aes(x=height, y=weight, label=seq(1,nrow(d))))
> p + coord_cartesian()

> p + coord_flip()

> p + coord_trans(x="log10", y="log10")

> p + coord_equal()

> p + coord_polar(theta="x")

> p + coord_polar(theta="y")

Scales: Controlling Aesthetic Mapping

Aesthetic mapping (i.e., with aes()) only says that a variable should be mapped to an aesthetic. It doesn’t say how that should happen. For example, when mapping a variable to shape with aes(shape = x) you don’t say what shapes should be used. Similarly, aes(color = z) doesn’t say what colors should be used. Describing what colors/shapes/sizes etc. to use is done by modifying the corresponding scale. In ggplot2 scales include

  • position
  • color and fill
  • size
  • shape
  • line type

Scales are modified with a series of functions using a scale_<aesthetic>_<type> naming scheme. To get an overview type

> help.search("scale_", package = "ggplot2")

Let us consider an example of a dopplot showing the distribution of home values by Date and State.

> 
> p3 <- ggplot(housing,
+              aes(x = State,
+                  y = Home.Price.Index)) + 
+         theme(legend.position="top",
+               axis.text=element_text(size = 6))
> (p4 <- p3 + geom_point(aes(color = Date),
+                        alpha = 0.5,
+                        size = 1.5,
+                        position = position_jitter(width = 0.25, height = 0)))

Now, let us modify the breaks for the x axis and color scales.

> p4 + scale_x_discrete(name="State Abbreviation") +
+   scale_color_continuous(name="",
+                          breaks = c(1976, 1994, 2013),
+                          labels = c("'76", "'94", "'13"))

Next, change the low and high values to blue and red.

> p4 +
+   scale_x_discrete(name="State Abbreviation") +
+   scale_color_continuous(name="",
+                          breaks = c(1976, 1994, 2013),
+                          labels = c("'76", "'94", "'13"),
+                          low = "blue", high = "red")

> library(scales)
> p4 +
+   scale_color_continuous(name="",
+                          breaks = c(1976, 1994, 2013),
+                          labels = c("'76", "'94", "'13"),
+                          low = muted("blue"), high = muted("red"))

Faceting

Faceting allows to create multiple graphs for subsets of data. ggplot2 offers two functions for creating small multiples

  • facet_wrap(): define subsets as the levels of a single grouping variable
  • facet_grid(): define subsets as the crossing of two grouping variables
> p5 <- ggplot(housing, aes(x = Date, y = Home.Value))
> p5 + geom_line(aes(color = State))

There are two problems here. First, there are too many states to distinguish each one by color, and secondly, the lines obscure one another.

> (p5 <- p5 + geom_line() + 
+   facet_wrap(~State, ncol=10))

Themes

The ggplot2 theme system handles non-data elements such as

All theme options are documented in ?theme. There are a number of built-in themes, e.g.

> p5 + theme_linedraw()

> p5 + theme_light()

Specific theme elements can be overridden using theme().

> p5 + theme_minimal() + 
+   theme(text = element_text(color = "turquoise"))

You can create new themes, as in the the following example:

> theme_new <- theme_bw() +
+   theme(plot.background = element_rect(size = 1, color = "blue", fill = "grey"),
+         text=element_text(size = 12, family = "Serif", color = "ivory"),
+         axis.text.y = element_text(colour = "pink"),
+         axis.text.x = element_text(colour = "red"),
+         panel.background = element_rect(fill = "pink"),
+         strip.background = element_rect(fill = "pink"))
>         
> p5 + theme_new

Examples of ggplot2 plots

Densityplot of a univarite continuous variables

> vec <- data.frame("x"=rnorm(100))
> ggplot(vec, aes(x=x)) + 
+   geom_histogram(aes(y=..density..)) + 
+   geom_rug() +
+   stat_function(fun = dnorm, colour = "red",
+                 args = list(mean = mean(vec$x), sd = sd(vec$x)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histogram

> ggplot(data = diamonds) + 
+   geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

QQ Plot

> vec <- data.frame("x"=rnorm(100))
> slope <- diff(quantile(vec$x, c(0.25, 0.75))) / diff(qnorm(c(0.25, 0.75)))
> int <- quantile(vec$x, 0.25) - slope * qnorm(0.25)
> ggplot(vec, aes(sample=x)) + 
+   stat_qq() + 
+   geom_abline(aes(slope = slope, intercept = int), col="red")

Grouped Dotplot

> library(dplyr)
> clss <- mpg %>%
+   group_by(class) %>%
+   dplyr::summarise(n = n(), hwy = mean(hwy))
> 
> clss$n.spell <- paste0("n = ", clss$n)
> 
> ggplot(data = mpg, aes(y = hwy, x = class)) + 
+   geom_jitter() + 
+   geom_point(data = clss, aes(x = class, y = hwy), col = "red", size = 4) + 
+   geom_text(data = clss, aes(label = n.spell, y = 10 ))

Univariate categorical data

> ggplot(data = diamonds) +
+   geom_bar(mapping = aes(x = cut))

Comparison between baseR and gglot2

Compared to base graphics, ggplot2

  • is more verbose for simple graphics
  • is less verbose for complex / custom graphics
  • does not have methods! (data should always be in a data frame)
  • does not include 3-dimensional graphics (see the rgl package)
  • does not include Graph-theory type graphs (see the igraph package)
  • does not include interactive graphics (see the ggvis)

A good way to compare to the two philosophies is to create the same plots with both aproaches. Let us have a look at a few excellent examples taken from the blog “Tufte in R” by Lukasz Piwek to be found under .

Minimal line plot in base graphics

> x <- 1967:1977
> y <- c(0.5,1.8,4.6,5.3,5.3,5.7,5.4,5,5.5,6,5)
> plot(y ~ x, axes=F, xlab="", ylab="", pch=16, type="b")
> axis(1, at=x, label=x, tick=F, family="serif")
> axis(2, at=seq(1,6,1), label=sprintf("$%s", seq(300,400,20)), tick=F, las=2, family="serif")
> abline(h=6,lty=2)
> abline(h=5,lty=2)
> text(max(x), min(y)*2.5,"Per capita\nbudget expanditures\nin constant dollars", adj=1, 
+      family="serif")
> text(max(x), max(y)/1.08, labels="5%", family="serif")

Minimal line plot in ggplot2

> library(ggplot2)
> library(ggthemes)
> x <- 1967:1977
> y <- c(0.5,1.8,4.6,5.3,5.3,5.7,5.4,5,5.5,6,5)
> d <- data.frame(x, y)
> ggplot(d, aes(x,y)) + geom_line() + geom_point(size=3) + theme_tufte(base_size = 15) +
+   theme(axis.title=element_blank()) + geom_hline(yintercept = c(5,6), lty=2) + 
+   scale_y_continuous(breaks=seq(1, 6, 1), label=sprintf("$%s",seq(300,400,20))) + 
+   scale_x_continuous(breaks=x,label=x) +
+   annotate("text", x = c(1977,1977.2), y = c(1.5,5.5), adj=1,  family="serif",
+            label = c("Per capita\nbudget expandures\nin constant dollars", "5%"))

Marginal histogram scatterplot with base graphics

> library(devtools)
> source_url("https://raw.githubusercontent.com/sjmurdoch/fancyaxis/master/fancyaxis.R")
SHA-1 hash of file is 1eaf278aced5a1206f1c0bffab087d26f44e3f73
> x <- faithful$waiting
> y <- faithful$eruptions
> plot(x, y, main="", axes=FALSE, pch=16, cex=0.8,
+      xlab="Time till next eruption (min)", ylab="Duration (sec)", 
+      xlim=c(min(x)/1.1, max(x)), ylim=c(min(y)/1.5, max(y)))
> axis(1, tick=F)
> axis(2, tick=F, las=2)
> axisstripchart(faithful$waiting, 1)
> axisstripchart(faithful$eruptions, 2)

Marginal histogram scatterplot with ggplot2

> library(ggplot2)
> library(ggExtra)
> library(ggthemes)
> p <- ggplot(faithful, aes(waiting, eruptions)) + geom_point() + theme_tufte(ticks=F)
> ggMarginal(p, type = "histogram", fill="transparent")

Slopegraph in base graphics

> library(devtools)
> #install_github("leeper/slopegraph", force = TRUE)#install Leeper's package from Github
> library(slopegraph)
> data(cancer)
> slopegraph(cancer, col.lines = 'gray', col.lab = 1, col.num = 1,
+            xlim = c(-.2,5),
+            main = "Estimate of % survival rates",
+            xlabels = c('5 Year','10 Year','15 Year','20 Year'))

Slopegraph in ggplot2

> library(ggplot2)
> library(ggthemes)
> library(devtools)
> library(RCurl)
> library(plyr)
> source_url("https://raw.githubusercontent.com/jkeirstead/r-slopegraph/master/slopegraph.r")
SHA-1 hash of file is 2b0f7bf51a5e18d2ea638a502ef449b6e654d27a
> d <- read.csv(text = getURL("https://raw.githubusercontent.com/jkeirstead/r-slopegraph/master/cancer_survival_rates.csv"))
> df <- build_slopegraph(d, x="year", y="value", group="group", method="tufte", min.space=0.04)
> df <- transform(df, x=factor(x, levels=c(5,10,15,20),
+                               labels=c("5 years","10 years","15 years","20 years")), y=round(y))
>  plot_slopegraph(df) + labs(title="Estimates of % survival rates") +
+    theme_tufte(base_size=16, ticks=F) + theme(axis.title=element_blank())
Loading required package: grid
Warning: `axis.ticks.margin` is deprecated. Please set `margin` property of
`axis.text` instead
Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
instead